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Abstract. We present sensitivity analysis for results of query executions 
in a relational model of data extended by ordinal ranks. The underlying 
model of data results from the ordinary Codd's model of data in which 
we consider ordinal ranks of tuples in data tables expressing degrees to 
which tuples match queries. In this setting, we show that ranks assigned 
to tuples are insensitive to small changes, i.e., small changes in the input 
data do not yield large changes in the results of queries. 
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1 Introduction 

Since its inception, the relational model of data introduced by E. Codd [TU] 
has been extensively studied by both computer scientists and database systems 
developers. The model has become the standard theoretical model of relational 
data and the formal foundation for relational database management systems. 
Various reasons for the success and strong position of Codd's model are ana- 
lyzed in '14| , where the author emphasizes that the main virtues of the model 
like logical and physical data independence, declarative style of data retrieval 
(database querying), access flexibility and data integrity are consequences of a 
close connection between the model and the first-order predicate logic. 

This paper is a continuation of our previous work |4I5| where we have intro- 
duced an extension of Codd's model in which tuples are assigned ordinal ranks. 
The motivation for the model is that in many situations, it is natural to con- 
sider not only the exact matches of queries in which a tuple of values either 
does or does not match a query Q but also approximate matches where tuples 
match queries to degrees. The degrees of approximate matches can usually be 
described verbally using linguistic modifiers like "not at all (matches)" "almost 
(matches)", "more or less (matches)", "fully (matches)", etc. From the user's 
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point of view, each data table in our extended relational model consists of (i) 
an ordinary data table whose meaning is the same as in the Codd's model and 
(ii) ranks assigned to all tuples in the original data table. This way, we come 
up with a notion of a ranked data table (shortly, an RDT). The ranks in RDTs 
are interpreted as "goodness of match" and the interpretation of RDTs is the 
same as in the Codd's model — they represent answers to queries which are, in 
addition, equipped with priorities expressed by the ranks. A user who looks at an 
answer to a query in our model is typically looking for the best match possible 
represented by a tuple or tuples in the resulting RDT with the highest ranks 
(i.e., highest priorities). 

In order to have a suitable formalization of ranks and to perform operations 
with ranked data tables, we have to choose a suitable structure for ranks. Since 
ranks are meant to be compared by users, the set L of all considered ranks should 
be equipped with a partial order <, i.e. (L, <) should be a poset. Moreover, it 
is convenient to postulate that {L,<) is a complete lattice [7], i-C, for each 
subset ACL, its least upper bound (a supremum) and greatest lower bound 
(an infimum) exist. This way, for any A C L, one can take the least rank in L 
which represents a higher priority (a better match) than all ranks from A. Such 
a rank is then the supremum of A (dually for the infimum). Since (L, <) is a 
complete lattice, it contains the least element denoted (no match at all) and 
the greatest element denoted 1 (full match). 

The set L of all ranks should also be equipped with additional operations 
for aggregation of ranks. Indeed, if tuple t with rank a is obtained as one of 
the results of subquery Qi and the same t with another rank b is obtained from 
answers to subquery Q2 then we might want to express the rank to which t 
matches a compound conjunctive query "Qi and Q2" ■ A natural way to do so 
is to take a suitable binary operation ®: L x L ^ L which acts as a conjunctor 
and take a®b for the resulting rank. Obviously, not every binary operation on 
L represents a (reasonable) conjunctor, i.e. we may restrict the choices only to 
particular binary operations that make "good conjunctors" . There are various 
ways to impose such restrictions. In our model, we follow the approach of using 
residuated conjunctions that has proved to be useful in logics based on residuated 
lattices |2I18I19| . Namely, we assume that is a commutative monoid 

(i.e., ® is associative, commutative, and neutral with respect to 1) and there is 
a binary operation — > on L such that for all a, 6, c £ L: 

a®b < c if and only if a < 6 — J> c. (1) 

Operations ® (a multiplication) and — (a residuum) satisfying ([TJ are called ad- 
joint operations. Altogether, the structure for ranks we use is a complete residu- 
ated lattice L — (i. A, V, 0, —J', 0, 1), i.e., a complete lattice in which ® and are 
adjoint operations, and A and V denote the operations of infimum and supre- 
mum, respectively. Considering L as a basic structure of ranks brings several 
benefits. First, in multiple- valued logics and in particular fuzzy logics |18I19| . 
residuated lattices are interpreted as structures of truth degrees and the rela- 
tionship ^ between ® (a fuzzy conjunction) and — >■ (a fuzzy implication) is 



derived from requirements on graded counterpart of the modus ponens deduc- 
tion rule (currently, there are many strong-complete logics based on residuated 
lattices). 

Remark 1. The graded counterpart of modus ponens |19l26j can be seen as a 
generalized deduction rule saying "from ip valid (at least) to degree a € L and 
ip ^ tp valid (at least) to degre b € L, infer tp valid (at least) to degree a (g) 6". If 
if-part of ([!]) ensures that the rule is sound while the only-if part ensures that 
it is as powerful as possible, i.e., a (g) 6 is the highest degree to which we infer 
tjj valid provided that (p valid at least to degree a and (/? => V' valid at least to 
degre b € L. This relationship between — (a truth function for logical connective 
imlication and ig) has been discovered in [T7] and later used, e.g., in |16l26j . 
Interestingly, ^ together with the lattice ordering ensure enough properties of 
— > and ig). For instance, — is antitone in the first argument and is monotone in 
the second one, condition a<&iffa— ;>&=! holds for all a, & G L, a — >• (& ^ c) 
equals (a (g) 6) — > c for aU a,b,c e L, etc. Since complete residuated lattices 
are in general weaker structures than Boolean algebras, not all laws satisfied by 
truth functions of the classic conjunction and implication are preserved by all 
complete residuated lattices. For instance, neither a^a — a (idempotency of (g) 
nor (a — )■ 0) — >■ = a (the law of double negation) nor a\/ {a ^ 0) — 1 (the 
law of the excluded middle) hold in general. Nevertheless, complete residuated 
lattices are strong enough to provide a formal framework for relational analysis 
and similarity-based reasoning as it has been shown by previous results. 

Second, our extension of the Codd's model results from the model by re- 
placing the two-element Boolean algebra, which is the classic structure of truth 
values, by a more general structure of truth values represented by a residuated 
lattice, i.e. we make the following shift in (the semantics of) the underlying logic: 

two-element Boolean algebra a complete residuated lattiee. 

Third, the original Codd's model is a special case of our model for L being the 
two-element Boolean algebra (only two borderline ranks 1 and are available). 
As a practical consequence, data tables in the Codd's model can be seen as RDTs 
where all ranks are either equal to 1 (full match) or (no match; tuples with 
rank are considered as not present in the result of a query) . Using residuated 
lattices as structures of truth degrees, we obtain a generalization of Codd's 
model which is based on solid logical foundations and has desirable properties. 
In addition, its relationship to residuated first-order logics is the same as the 
relationship of the original Codd's model to the classic first-order logic. The 
formalization we offer can further be used to provide insight into several isolated 
approaches that have been provided in the past, see e.g. [5], [H], [13], [21], [lH]i 
[50] , and a comparison paper [B] . 

A typical choice of L is a structure with L — [0, 1] (ranks are taken from 
the real unit interval), A and V being minimum and maximum, g) being a left- 
continuous (or a continuous) t-norm with the corresponding see |2ll8ll9j . For 
example, an RDT with ranks coming from such L is in Table [T] It can be seen 



Table 1. Houses for sale at $200,000 with square footage 1200 





agent 


id 


sqft 


age 


location 


price 


0.93 


Brown 


138 


1185 


48 


Vestal 


$228,500 


0.89 


Clark 


140 


1120 


30 


Endicott 


$235,800 


0.86 


Brown 


142 


950 


50 


Binghamton 


$189,000 


0.85 


Brown 


156 


1300 


85 


Binghamton 


$248,600 


0.81 


Clark 


158 


1200 


25 


Vestal 


$293,500 


0.81 


Davis 


189 


1250 


25 


Binghamton 


$287,300 


0.75 


Davis 


166 


1040 


50 


Vestal 


$286,200 


0.37 


Davis 


112 


1890 


30 


Endicott 


$345,000 



as a result of similarity-based query "show all houses which are sold for (approx- 
imately) $200,000 and have (approximately) 1200 square feet". The left-most 
column contains ranks. The remaining part of the table is a data table in the 
usual sense containing tuples of values. At this point, we do not explain in detail 
how the particular ranks in Table [1] have been obtained (this will be outlined in 
further sections) . One way is by executing a similarity-based query that uses ad- 
ditional information about similarity (proximity) of domain values which is also 
described using degrees from L. Note that the concept of a similarity-based query 
appears when human perception is involved in rating or comparing close values 
from domains where not only the exact equalities (matches) are interesting. For 
instance, a person searching in a database of houses is usually not interested in 
houses sold for a particular exact price. Instead, the person wishes to look at 
houses sold approximately at that price, including those which are sold for other 
prices that are sufficiently close. While the ranks constitute a "visible" part of 
any RDT, the similarities are not a direct part of RDT and have to be specified 
for each domain independently. They can be seen as an additional (background) 
information about domains which is supplied by users of the database system. 

Let us stress the meaning of ranks as priorities. As it is usual in fuzzy logics 
in narrow sense, their meaning is primarily comparative, cf. [191 p. 2] and the 
comments on comparative meaning of truth degrees therein. In our example, 
it means that tuple (Clark, 140, 1120, 30, Endicott, $235 ,800) with rank 0.89 
is a better match than tuple (Brown, 142, 950, 50, Binghamton, $189 , 000) whose 
rank 0.86 is strictly smaller. Thus, for end-users, the numerical values of ranks 
(if L is a unit interval) are not so important, the important thing is the relative 
ordering of tuples given by the ranks. 

Note that our model which provides theoretical foundations for similarity- 
based databases |4I5| should not be confused with models for probabilistic 
databases [2^ which have recently been studied, e.g. in |9I12I13I 20 22 25. . see 
also [llj for a survey. In particular, numerical ranks used in our model (if 
L = [0,1]) cannot be interpreted as probabilities, confidence degrees of belief 
degrees as in case of probabilistic databases where ranks play such roles. In 
probabilistic databases, the tuples (i.e., the data itself) are uncertain and the 
ranks express probabilities that tuples appear in data tables. Consequently, a 
probabilistic database is formalized by a discrete probability space over the pos- 



sible contents of the database 11 . Nevertheless, the underlying logic of the 
models is the classical two- valued first-order logic — only yes/no matches are al- 
lowed (with uncertain outcome). In our case, the situation is quite different. The 
data (represented by tuples) is absolutely certain but the tuples are allowed to 
match queries to degrees. This, translated in terms of logic, means that formulas 
(encoding queries) are allowed to be evaluated to truth degrees other than and 
1. Therefore, the underlying logic in our model is not the classic two-element 
Boolean logic as we have argued hereinbefore. 

In [T], a report written by leading authorities in database systems, the authors 
say that the current database management systems have no facilities for either 
approximate data or imprecise queries. According to this report, the manage- 
ment of uncertainty and imprecision is one of the six currently most important 
research directions in database systems. Nowadays, probabilistic databases (deal- 
ing with approximate data) are extensively studied. On the contrary, it seems 
that similarity-based databases (dealing with imprecise queries) have not yet 
been paid full attention. This paper is a contribution to theoretical foundations 
of similarity-based databases. 

2 Problem Setting 

The issue we address in this paper is the following. In our model, we can get 
two or more RDTs (as results of queries) which are not exactly the same but 
which are perceived (by users) as being similar. For instance, one can obtain 
two RDTs containing the same tuples with numerical values of ranks that are 
almost the same. A question is whether such similar RDTs, when used in subse- 
quent queries, yield similar results. In this paper, we present a preliminary study 
of the phenomenon of similarity of RDTs and its relationship to the similarity 
of query results obtained by applying queries to similar input data tables. We 
present basic notions and results providing formulas for computing estimations 
of similarity degrees. The observations we present provide a formal justification 
for the phenomenon discussed in the previous section — slight changes in ranks 
do not have a large impact on the results of (complex) queries. The results are 
obtained for any complete residuated lattice taken as the structure of ranks 
(truth degrees). Note that the basic query systems in our model are (extensions 
of) domain relational calculus I5l24j and relational algebra |4|24j . We formulate 
the results in terms of operations of the relational algebra but due to its equiv- 
alence with the domain relational calculus [5], the results pertain to both the 
query systems. Thus, based on the domain relational calculus, one may design 
a declarative query language preserving similarity in which execution of queries 
is based on transformations to expressions of relational algebra in a similar way 
as in the classic case [24] . 

The rest of the paper is organized as follows. Section |3] presents a short 
survey of notions. Section|3]contains results on sensitivity analysis, an illustrative 
example, and a short outline of future research. Because of the limited scope of 
the paper, proofs are sketched or omitted. 



3 Preliminaries 



In this section, we recall basic notions of RDTs and relational operations we 
need to provide insight into the sensitivity issues of RDTs in Section |4l Details 
can be found in |2|4|6) . In the rest of the paper, L always refers to a complete 
residuated lattice L= (L,A,V,®,^>,0,1), see Section [TJ 

3.1 Basic Structures 

Given L, we make use of the following notions: An L-set A in universe C/ is a 
map A: U —i' L, A{u) being interpreted as "the degree to which u belongs to 
A" . If L is the two-element Boolean algebra, then A : [/ ^> L is an indicator 
function of a classic subset of U, A{u) = 1 {A{u) = 0) meaning that u belongs 
(does not belong) to that subset. In our approach, we tacitly identify sets with 
their indicator functions. In a similar way, a binary L-relation _B on [/ is a map 
B : U X U L, B{ui,U2) interpreted as "the degree to which ui and U2 are 
related according to B" . Hence, B is an L-set in universe U x U. 

3.2 Ranked Data Tables over Domains with Similarities 

We denote by K a set of attributes, any subset R CY is called a relation scheme. 
For each attribute y G Y we consider its domain Dy. In addition, each Dy is 
equipped with a binary L-relation «j, on Dy satisfying reflexivity (u «j, u = 1) 
and symmetry u K,y v = v ~y u (for all u,w e Dy). Each binary L-relation K,y 
on Dy satisfying (i) and (ii) shall be called a similarity. Pair {Dy,K,y) is called 
a domain with similarity. 

Tuples contained in data tables will be considered as usual, i.e., as el- 
ements of Cartesian products of domains. Recall that a Cartesian product 
Hie/ ^« °^ /-indexed system {Di \ i G /} of sets Di [i e /) is a set of 
all maps t : I ^ Uie/ such that t{i) e Di holds for each i ^ I. Un- 
der this notation, a tuple over R Q Y is any element from Yiy^RDy For 
brevity, YiyeR^v denoted by Tupl(i?). Following the example in Table [1] 
tuple (Brown, 142, 950, 50, Binghamton, $189,000) is a map r e Tupl(i?) for 
R = {agent, id, . . . ,price} such that r{agent) = Brown, r{id) ~ 142, etc. 

A ranked data table on R C_ Y over {{Dy,Kiy) | y G i?} (shortly, an RDT) 
is any (finite) L-set T) in Tupl(i?). The degree I?(r) to which r belongs to V is 
called a rank of tuple r in V. According to its definition, if V is an RDT on R 
over {{Dy, «j,) | y G R} then P is a map T) : Tupl(i?) — > L. Note that V is an 
n-ary L-relation between domains Dy {y G Y) since D is a map from YiyeR^v 
to L. In our example, 2?(r) = 0.86 for r being the tuple with r{id) ~ 142. 

3.3 Relational Operations with RDTs 

Relational operations we consider in this paper are the following: For RDTs 
Vi and r»2 on T, we put (2?i U I?2)(i) = 2?i(i) V 2?2(i) and (Vi n 2?2)(0 = 



Vi{t) A V2{t) for each t e Tupl(r); Vi U V2 and 2?i n X»2 are called the union 
and the ^-intersection of 2?i and 2?2, respectively. Analogously, one can define 
an ^-intersection 2?i (g)2?2- Hence, U, n, and ® are defined componentwise based 
on the operations of the complete residuated lattice L. 

Moreover, our model admits new operations that are trivial in the classic 
model. For instance, for a G i, we introduce an a-shift a— >2? of T) by {a-^'D){t) = 
a V{t) for ah t € Tupl(T). 



Remark 2. Note that if L is the two-element Boolean algebra then a-shift is a 
trivial operation since 1 ^ V = T> and ^ V produces a possibly infinite table 
containing all tuples from Tupl(T). In our model, an a-shift has the following 
meaning: If 2? is a result of query Q then {a^'D){t) is a "degree to which t 
matches query Q at least to degree a" . This follows from properties of residuum, 
see |2ll9j . Hence, a-shifts allow us to emphasize results that match queries at 
least to a prescribed degree a. 

The remaining relational operations we consider represent counterparts of 
projection, selection, and join in our model. If T) is an RDT on T, the projection 
■kb{T^) of V onto i? C T is defined by 

{■KR{V)){r) = V.eTupi(T\fl)^('^'S), 

for each r G Tupl(i?). In our example, the result of {iocation}iT^) is a ranked 
data table with single column such that TTjiocation} (2^)((Binghamton)) = 0.86, 
7r{iocation}(2?)((Vestal}) = 0.93, and 7r{j„,,ti„„}(P)((Endicott)) = 0.89. 

A similarity-based selection is a counterpart to ordinary selection which se- 
lects from a data table all tuples which approximately match a given condition: 
Let T) be an RDT on T and lei y € T and d G Dy. Then, a similarity-based 
selection (7y~d(T>) of tuples in T) matching y w d is defined by 

{<Jy^diV)){t)=V{t)®t{y) d. 

Considering 2? as a result of query Q, the rank of t in ay~d{T>) can be interpreted 
as a degree to which matches the query Q and the y- value of t is similar to d" . 
In particular, an interesting case is crp~q(T>) where p and q are both attributes 
with a common domain with similarity. 

Similarity-based joins are considered as derived operations based on Cartr- 
sian products and similarity-based selections. For r G Tupl(i?) and s G Tupl(S') 
such that i?n S' = 0, we define a concatenation rs G Tupl(i?U S*) of tuples r and 
s so that {rs){y) = r{y) for y G i? and {rs){y) = s{y) for y £ S. For RDTs Vi 
and 2?2 on disjoint relation schemes 5* and T we define a RDT I?i x I'2 on 5 U T, 
called a Cartesian product of 2?i and 2?2, by {Vi x2?2)(st) = T>i{s)^'D2{t). Using 
Cartesian products and similarity-based selections, we can introduce similarity- 
based 9-joins such as 2?i ixip-g V2 = (Jp~qiVi x X'2). Various other types of 
similarity-based joins can be introduced in our model, see [5] . 



4 Estimations of Sensitivity of Query Results 



4.1 Rank-Based Similarity of Query Results 

We now introduce the notion of similarity of RDTs which is based on the idea 
that RDTs Pi and (on the same relation scheme) are similar iff for each tuple 
t, ranks X'i(i) and ^2(0 ''-^^ similar (degrees from L). Similarity of ranks can 
be expressed by biresiduum -f-^- (a fuzzy equivalence |2I18I19| ) which is a derived 
operation of L such that a ^ b = {a ^ b) A {b a). Since we are interested 
in similarity of 'Di{t) and 'D2{t) for all possible tuples t, it is straightforward to 
define the similarity i?(2?i, ^2) of Vi and 2?2 by an infimum which goes over all 
tuples: 

EiV,,V2) - AteTupi(T)(^i(^) ^ ^2(i)). (2) 

An alternative (but equivalent) way is the following: we first formalize a degree 
S{Vi,'D2) to which Vi is included in X'2- We can say that Vi is fully included in 
'D2 iff, for each tuple i, the rank 1^2 (i) is at least as high as the rank 'Di{t). Notice 
that in the classic (two-values) case, this is exactly how one defines the ordinary 
subsethood relation "C". Considering general degrees of inclusion (subsethood) , 
a degree S{T>i,T>2) to which Vi is included in T>2 can be defined as follows: 

5(2?i, P2) = AteTupi(T) (2^1 (0 ^ ^2(0) . (3) 

It is easy to prove [2 that ([2]) and ([3|) satisfy: 

E{Vi,V2)^S{Vi,V2)ASiV2,Vi). (4) 

Note that E and S defined by © and (jS]) are known as degrees of similarity 
and subsethood from general fuzzy relational systems P| (in this case, the fuzzy 
relations are RDTs). 

The following assertion shows that U, fl, €5, and a-shifts preserve subsethood 
degrees given by ([3]). In words, the degree to which Vi UV2 is included in V[ UPj 
is at least as high as the degree to which 2?i is included in and I?2 is included 
in 2?2- A similar verbal description can be made for the other operations. 

Theorem 1. For any T>'i, T>2, cind Pj on relation scheme T, 



S{Vi,V[) A S{V2,V'^) < S{ViUV2,V[UV'^), (5) 

s(Vi,v[) A siV2,v'2) < s{VinV2,v[nv'2), (6) 

S{Vi,V[)(g>S{V2,V'2) < S{Vi(g)V2,V[(E)V'2), (7) 

S{Vi,V2) < S{a^Vi,a'^V2). (8) 



Proof (sketch). ([5|): Using adjointness, it sufiices to check that (^S{'Di,V[) A 
S{V2,V'2)) (2?i U V2){t) < (2?i U V'2){t) holds true for any t G Tupl(T). 
Using ©, the monotony of <g) and A yields {S{Vi,V[) A 5(252, ^2)) C^i U 
V2m < {(Viit) ^ V[it)) A {V2{t) ^ V'2{t))) ® iV,{t) V V2it)). Applying 
a(8)(6Vc) = {a®b)V {a®c) to the latter expression, we get ( (2?i (i ) — J> 2?^ (t ) ) A 
(2?2(t) ^ V'2it))) (Viit) V V2it)) < ((2?i(i) ^ V[it)) Viit)) V ((2?2(t) ^ 



V'^it)) (g) V2{t)) . Using a(E){a ^ b) < b twice, it follows that {{Vi{t) ^[{1)) (g) 
Vi(t))v{{V2{t) V'^{t))®V2(t)) < V[(t)\JV'^{t). Putting previous inequalities 
together, [S{Vi,V'^) AS(V2,V'^))®{Vi{JV2){t) < (X»iUX»Q(t) which proves ©. 
([S]) can be proved analogously as ([5]) ; (O can be proved analogously as ^ using 
monotony of Cg); ^ follows from the fact that a b < {c ^ a) ^ {c ^ b). □ 

Using (HI , we have the following consequence of Theorem [TJ 

Corollary 1. For being fl and U, we have: 

E{Vi,V[) AE{V2,V'2) < EiVi0V2,V[0V'^). (9) 

E{Vi,V[) ® E{V2,V'2) < E{Vi(g)V2,V[(E>V'2). (10) 

E{Vi,V2) <E{a^Vi,a^V2). (11) 

Proof (sketch). For being n, © applied twice yields: 5(2?!, 2?i) A 5(r>2, 2?2) < 
S{Vi n r>2, 2?i n V'2) and 5'(2?i, Pi) A S{V'2, V2) < S{V[ n 2?^, 2?i n V2). Hence, 
([9|) for n follows using ([2]). The rest is analogous. □ 

Using the idea in the proof of Corollary [l] in order to prove that operation 
O preserves similarity, it suffices to check that O preserves (graded) subsethood. 
Thus, from now on, we shall only investigate whether operations preserve sub- 
sethood. In case of Cartesian products, we have: 

Theorem 2. Let Vi and "U'l be RDTs on relation scheme S and let 1^2 and 2?2 
be RDTs on relation scheme T such that S HT = 9. Then, 

S{Vi,V[)(S)S{V2,V'2) < S{Vi X V2,V[ X V'2), (12) 

Proof (sketch). The proof is analogous to that of ([7]). □ 

The following assertion shows that projection and similarity-based selection 
preserve subsethood degrees (and therefore similarities) of RDTs: 

Theorem 3. Let T> and V be RDTs on relation scheme T and let y ^ T, 

d £ Dy, and R <Z T. Then, 

S{V,V') < S{7rn{V),7Tn{V')), (13) 
S{V, V) < S{ay^d{V),ay^d{V')). (14) 

Proof (sketch). In oder to prove (HH), we check S{V,V') ® (7rij(P))(r) < 
{Trii{T>')){r) for any r G Tupl(i?). It means showing that 

5(P,2?')®V.eTupi(nfl)2'M < {7rn{V')){r). 

Thus, is suffices to prove S{V,V') ®V{rs) < {TrR{V')){r) for aU s £ Tupl(r \ 
R). Using monotony of (g), we get S{V,V') (g) V{rs) < (V{rs) -J> V'{rs)) g) 
V{rs) < V'{rs), because rs e Tupl(T). Therefore, S{V,V')®V{rs) < V'{rs) < 
VsGTupi(T\_R) ^'(^*) = (^-r(^' ))('') J which proves the first claim of In case 
of dH]), we proceed analogously. □ 



Table 2. Alternative ranks for houses for sale from Table [T] 





agent 


id 


sqft 


age 


location 


price 


0.93 


Brown 


138 


1185 


48 


Vestal 


$228,500 


0.91 


Clark 


140 


1120 


30 


Endicott 


$235,800 


0.87 


Brown 


156 


1300 


85 


Binghamton 


$248,600 


0.85 


Brown 


142 


950 


50 


Binghamton 


$189,000 


0.82 


Davis 


189 


1250 


25 


Binghamton 


$287,300 


0.79 


Clark 


158 


1200 


25 


Vestal 


$293,500 


0.75 


Davis 


166 


1040 


50 


Vestal 


$286,200 


0.37 


Davis 


112 


1890 


30 


Endicott 


$345 , 000 



Theorem [2] and Theorem |3] used together yield 

Corollary 2. Let 2?i and T)'^ he RDTs on relation scheme S and let P2 o.'rid 2?2 
be RDTs on relation scheme T such that S C\T = ^. Then, 

S{VuV[)^S{V2,V'2) < S{Vi Mp^, V2,V[ Mp«g V'2). (15) 

for any p ^ S and q d T having the same domain with similarity. □ 

As a result, we have shown that important relational operations in our model 
(including similarity-based joins) preserve similarity defined by 1^. Thus, we 
have provided a formal justification for the (intuitively expected but nontrivial) 
fact that similar input data yield similar results of queries. 

Remark 3. In this paper, we have restricted ourselves only to a fragment of 
relational operations in our model. In [5], we have shown that in order to have 
a relational algebra whose expressive power is the same as the expressive power 
of the domain relational calculus, we have to consider additional operations of 
residuum (defined componentwise using — )■) and division. Nevertheless, these 
two additional operations preserve E as well — it can be shown using similar 
arguments as in the proof of Theorem [1] As a consequence, the similarity is 
preserved by all queries that can be formulated in DRC [5]. 

4.2 Illustrative Example 

Consider again the RDT from Table [TJ The RDT can be seen as a result of 
querying a database of houses for sale where one wants to find a house which 
is sold for (approximately) $200,000 and has (approximately) 1200 square 
feet. The attributes in the RDT are: real estate agent name {agent), house 
ID {id), square footage {sqft), house age {age), house location {location), 
and house price {price). In this example, the complete residuated lattice 
L — (L, A, V, (X), — 0, 1) serving as the structure of ranks will be the so-called 
Lukasiewicz algebra |2|18|19) . That is, L = [0, 1], A and V are minimum and max- 
imum, respectively, and the multiplication and residuum are defined as follows: 
a €5 & = max(a + 6 — 1,0) and a — > 6 = min(l — a + b,l) for all a,b & L. 

Intuitively, it is natural to consider similarity of values in domains of sqft, 
age, location, and price. For instance, similarity of prices can be defined by 



Pi price P2 — s{\p2 - Pi\) using an antitone scaling function s : [0, oo) [0, 1] 
with s(0) — 1 (i.e., identical prices are fully similar). Analogously, a similarity of 
locations can be defined based on their geographical distance and/or based on 
their evaluation (safety, school districts, . . . ) by an expert. In contrast, there is 
no need to have similarities for id and agents because end-users do not look for 
houses based on (similarity of) their (internal) IDs which are kept as keys merely 
because of performance reasons. Obviously, there may be various reasonable 
similarity relations defined for the above-mentioned domains and their careful 
choice is an important task. In this paper, we neither explain nor recommend 
particular ways to do so because (i) we try to keep a general view of the problem 
and (ii) similarities on domains are purpose and user dependent. 

Consider now the RDT in Table [5] defined over the same relation scheme as 
the RDT in Table [TJ These two RDTs can be seen as two (slightly different) 
answers to the same query (when e.g., the domain similarities have been slightly 
changed) or answers to a modified query (e.g., "show all houses which are sold for 
(approximately) $210,000 and. . ."). The similarity of both the RDTs given by 
^ is 0.98 (very high). The results in the previous section say that if we perform 
any (arbitrarily complex) query (using the relational operations we consider in 
this paper) with Table [H instead of Table [1] the results will be similar at least 
to degree 0.98. 



Table 3. Join of Table [T] and the table of customers 





agent 


id 


price 


name 


budget 


0.91 


Brown 


138 


$228,500 


Grant 


$240,000 


0.89 


Brown 


138 


$228,500 


Evans 


$250,000 


0.89 


Brown 


138 


$228,500 


Finch 


$210,000 


0.88 


Clark 


140 


$235,800 


Grant 


$240,000 


0.86 


Clark 


140 


$235,800 


Evans 


$250,000 


0.84 


Brown 


156 


$248,600 


Evans 


$250,000 


0.16 


Davis 


112 


$345,000 


Grant 


$240,000 


0.10 


Davis 


112 


$345,000 


Finch 


$210,000 



For illustration, consider an additional RDT of customers over relation 
scheme containing two attributes: name (customer name) and budget (price 
the customer is willing to pay for a house). In particular, let (Evans, $250,000), 
(Finch, $210,000), and (Grant, $240,000) be the only tuples in the RDT (ah 
with ranks 1). The answer to the following query 

'^{agent ,id ,price^name ^budget } (^1 ^price ^budget ^c)i 

where 2?i stands for Table [1] and Vc stands for the RDT of customers is in 
Table |3] (for brevity, some records are omitted) . The RDT thus represents an 
answer to query "show deals for houses sold for (approximately) $200 , 000 with 
(approximately) 1200 square feet and customers so that their budget is similar 
to the house price" . Furthermore, we can obtain an RDT of best agent-customer 



Table 4. Results of agent-customer matching for Table [T] and Table |2] 





agent 


name 


0.91 


Brown 


Grant 


0.89 


Brown 


Evans 


0.89 


Brown 


Finch 


0.88 


Clark 


Grant 


0.86 


Clark 


Evans 


0.84 


Clark 


Finch 


0.74 


Davis 


Evans 


0.72 


Davis 


Grant 


0.66 


Davis 


Finch 





agent 


name 


0.91 


Brown 


Grant 


0.90 


Clark 


Grant 


0.89 


Brown 


Evans 


0.89 


Brown 


Finch 


0.88 


Clark 


Evans 


0.86 


Clark 


Finch 


0.75 


Davis 


Evans 


0.73 


Davis 


Grant 


0.67 


Davis 


Finch 



matching is we project the join onto agent and name: 

^ {agent .name } (^1 ^^price^budget ^c) ■ 

The result of matching is in Table |3](left). Due to our results, if we perform the 
same query with Table [5] instead of Table [U the new result is guaranteed to be 
similar with the obtained result at least to degree 0.98. The result for Table[5]is 
shown in Table |4](right). 



4.3 Tuple-Based Similarity and Further Topics 

While the rank-based similarity from Section |4?11 can be sufficient in many cases, 
there are situations where one wants to consider a similarity of RDTs based on 
ranks and (pairwise) similarity of tuples. For instance, if we take the RDT from 
Table [T] and make a new one by taking all tuples (keeping their ranks) and 
increasing the prices by one dollar, we will come up with an RDT which is, ac- 
cording to rank-based similarity, very different from the original one. Intuitively, 
one would expect to have a high degree of similarity of the RDTs because they 
differ only by a slight change in price. This issue can be solved by considering 
the following tuple-based degree of inclusion: 

5"(Pi,2?2) = At6Tupi(T)(^iW ^ Vt'sTupiml^'aCiOsst^t'))' (16) 

where t k, t' = Aj^eT ^iv) ^'iv) ^ similarity of tuples t and t' over T, cf. [6]. 
In a similar way as in we may define E~ using S~ instead of S. 

Remark 4- By an easy inspection, SiVi,'D2) < S'~(2?i, i-C- (IT6| yields an 
estimate which is at least as high as ([3]) and analogously for E and E~ . Note 
that (IT51) has a natural meaning. Indeed, S'~(X'i,2?2) can be understood as a 
degree to which the following statement is true: "If t belongs to Vi, then there 
is t' which is similar to t and which belongs to 2?2"- Hence, iJ~(2?i,2?2) is a 
degree to which for each tuple from Vi there is a similar tuple in T>2 and vice 
versa. If L is a two-element Boolean algebra and each fHy is an identity, then 
E"{Pi,V2) = 1 iff Vi and X'2 are identical (in the usual sense). 



For tuple-based inclusion (similarity) and for certain relational operations, 
we can prove analogous preservation formulas as in Section 23] For instance, 

S~{Vi,V[) A S{V2,V'^) < S~{ViUV2,V[UV'^), (17) 
5*(Pi,P'i) ®5(P2,2?2) < S^{Vi X V2,V[ X V'2), (18) 
S^iV,V') < S^{itr{V),^r{V')). (19) 

On the other hand, similarity-based selection <Ty~^ (and, as a consequence, 
similarity-based join ix]p~g) does not preserve S~ in general which can be seen 
as a technical complication. This issue can be overcome by introducing a new 
type of selection fJy^^ which is compatible with S~ . Namely, we can define 

{^y^di'D)) it) = Vt'eTupKT) {W) ®t'^t® t{y) d) . (20) 

For this notion, we can prove that S~iV,V') < S~{a^^j^{'D),a^^^{'D')). Similar 
extension can be done for any relational operation which does not preserve S~ 
directly. Detailed description of the extension is postponed to a full version of 
the paper because of the limited scope. 

4.4 Unifying Approach to Similarity of RDTs 

In this section, we outline a general approach to similarity of RDTs that in- 
cludes both the approaches from the previous sections. Interestingly, both 
and (1161) have a common generalization using truth-stressing hedges |19l21j . 
Truth-stressing hedges represent unary operations on complete residuated lat- 
tices (denoted by * ) that serve as interpretations of logical connectives like "very 
true", see [H]. Two boundary cases of hedges are (i) identity, i.e. a* = a {a & L); 
(ii) globalization: 1* = 1, and a* = if a < 1. The globalization [JT] is a hedge 
which can be interpreted as "fully true" . 

Let * be truth-stressing hedge on L. For RDTs 2?i,2?2 on T, we define the 
degree S'^(X'i,2?2) of inclusion of Vi in V2 (with respect to *) by 

5~(A,2?,) = AteTupi(T)(A:W ^ Vt'eTupi(T)(^.(^') «5 (i « i')*))- (21) 

Now, it is easily seen that for * being the identity, (PT|) coincides with p^ : if w 
is separating (i.e., « ^2 = 1 iff ti is identical to 12) and * is the globalization, 
(PT|) coincides with ([3]). Thus, both ([3]) and ([TS]) are particular instances of (|2ip 
resulting by a choice of the hedge. Note that identity and globalization are two 
borderline cases of hedges. In general, complete residuated lattices admit other 
nontrivial hedges that can be used in (|2ip . Therefore, the hedge in (PT|) serves as 
a parameter that has an influence on how much emphasis we put on the fact that 
two tuples are similar. In case of globalization, we put full emphasis, i.e., the 
tuples are required to be equal to degree 1 (exactly the same if « is separating) . 

If we consider properties needed to prove analogous estimation formulas for 
general as we did in case of S and S~, we come up with the following 
important property: 



(r « s)* (s w i)* < (r «t)*. 



(22) 



for every r, s,t E Tupl(r) which can be seen as transitivity of w with respect to 
(g) and *. Consider the fohowing two cases in which (|22p is satisfied: 

Case 1: * is globahzation and w is separating. If the left hand side of (1^^ is 
nonzero, then r « s = 1 and s « i = 1. Separabihty imphes r = s = t, 
i.e. (r « i)* = 1* = 1, verifying ((22l) . 

Case 2: « is transitive. In this case, since a*(S)b* < {a^b)* (follows from proper- 
ties of hedges by standard arguments) , transitivity of « and monotony 
of * yield (r « s)* (g) (s « t)* < ((r « s) ® (s ~ t))* < (r « t)*. 

The following lemma shows that and consequently have properties 
that are considered natural for (degrees of) inclusion and similarity: 

Lemma 1. // ?» satisfies (j22p with respect to * then 

(i) is a reflexive and transitive h-relation, i.e. an "L-quasiorder. 

(a) defined by £'^(2?i,2?2) — S^{'Di,'D2) A S'^(X'2,2?i) is a reflexive, sym- 
metric, and transitive "L-relation, i.e. an "L- equivalence. 

Proof. The assertion follows from results in [2 , Section 4.2] by taking into account 
that «* is reflexive, symmetric, and transitive with respect to ®. □ 

5 Conclusion and Future Research 

We have shown that an important fragment of relational operation in similarity- 
based databases preserves various types of similarity. As a result, similarity of 
query results based on these relational operations can be estimated based on 
similarity of input data tables before the queries are executed. Furthermore, the 
results of this paper have shown a desirable important property of the underlying 
similarity-based model of data: slight changes in input data do not produce huge 
changes in query results. Future research will focus on the role of particular 
relational operations called similarity-based closures that play an important role 
in tuple-based similarities of RDTs. An outline of results in this direction is 
presented in j3j. 
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